The differential entropy of the posterior is defined as:
$$ H(y|\mathbf{x}) = E[I(y|\mathbf{x})] = \int_X{P(y|\mathbf{x})\cdot \log(P(y|\mathbf{x})) d\mathbf{x}} $$In order to determine which point to examine next, we want to minimize the expected entropy of the posterior distribution after examining that point. Simply picking the highest point is computationally expensive and will give us poor results so instead we sample stochastically.
KL divergence \cite{box1967discrimination} p. 62 is maybe the most useful metric here.
Difference in information old posterior vs. new posterior.
In [25]:
import scipy.stats
print(scipy.stats.norm.entropy(0,1))
print(scipy.stats.norm.entropy(0,2))
print(scipy.stats.uniform.entropy(0,0.5))
print(scipy.stats.uniform.entropy(0,1))
print(scipy.stats.uniform.entropy(0,2))
Boxoptimizes for information gain over the class of candidate models instead. Computationally simpler, less appropriate for open-ended problems.